fix(oauth): narrow context.lock scope in async_auth_flow#2660
Conversation
…poll requests
Previously the entire `OAuthClientProvider.async_auth_flow` body ran under
`self.context.lock`, including the `yield request` that hands the request
off to httpx. For requests that complete quickly this is fine, but a GET
SSE long-poll holds the lock for the full SSE lifetime — which means any
concurrent POST (e.g. `tools/call`) is blocked waiting for the lock,
producing a ~16s perceived stall on lazy MCP connections that use OAuth.
This commit splits the single coarse lock into purpose-specific scopes:
Phase 1 (context.lock): initialize state, capture protocol_version, and
decide whether a refresh is needed. Short-held; no HTTP I/O.
Phase 2 (refresh_lock, new): single-flight token refresh. The refresh
request `yield` happens outside any lock. A double-check inside
`context.lock` ensures concurrent waiters do not redundantly refresh
after another coroutine completed one.
Phase 3 (no lock): add the auth header and yield the actual request.
GET SSE long-polls and other long-running requests no longer block
unrelated traffic.
Phase 4 (context.lock): 401 / 403 full OAuth re-auth path. Conservatively
kept under lock because this path is rare and its yielded sub-requests
(metadata discovery, registration, token exchange) hit the AS, not the
resource server. A future refactor can narrow this further.
Lock additions:
- `OAuthContext.refresh_lock: anyio.Lock` provides single-flight refresh
so concurrent requests trigger at most one token refresh.
Behavior changes:
- Concurrent requests through the same `OAuthClientProvider` no longer
serialize at the lock. GET SSE long-polls and POSTs now proceed in
parallel.
- Token refresh remains serialized (via `refresh_lock`), preserving the
invariant that only one refresh request is in flight at a time.
- Public API and behavior are otherwise unchanged.
Related upstream issue:
modelcontextprotocol#1326
…ests Two tests in a new TestConcurrentRequestsDoNotDeadlock class exercise the behavior the previous commit fixes: 1. ``test_concurrent_request_not_blocked_by_pending_long_running_request`` drives one async_auth_flow generator to its yield (= simulating a GET SSE long-poll suspended waiting for the next event) and then opens a second concurrent flow on the same provider. The second flow must reach its own yield within a short timeout — i.e., the lock release between Phase 1 and Phase 3 lets it through. Pre-fix, the second generator would block on context.lock indefinitely. 2. ``test_concurrent_token_refresh_is_single_flight`` exercises the refresh_lock single-flight path. A first flow performs the refresh yield; a second flow started after the refresh completes observes the freshly-updated token in Phase 1 and proceeds directly to its own request yield without issuing a second refresh. Also: tighten the refresh_request unbound-after-conditional-write pattern in async_auth_flow so pyright recognizes it as definitely assigned at the yield site (was: derived from a boolean predicate; now: typed as ``httpx.Request | None`` and checked explicitly).
After dropping the function-level `# pragma: no cover` from `_handle_refresh_response` and removing the per-line pragmas from the refactored Phase 2, the strict-no-cover audit identified covered lines still marked pragma'd and surfaced previously-untested branches. Three new tests close the coverage gaps: * `test_refresh_with_failed_status_clears_tokens` — exercises the ``response.status_code != 200`` branch in `_handle_refresh_response` and the `self._initialized = False` reset on refresh failure. * `test_refresh_with_invalid_json_clears_tokens` — exercises the ValidationError branch when the refresh body is not valid JSON. * `test_double_check_inside_refresh_lock_skips_second_refresh` — uses monkeypatch to flip `is_token_valid` between Phase 1 (False) and the double-check inside `refresh_lock` (True), exercising the branch where a second coroutine's refresh is correctly elided. Also: convert the new tests from the legacy Test* class pattern to plain top-level `test_*` functions per AGENTS.md, and drop unneeded per-line `# pragma: no cover` markers in the refactored auth_flow. Coverage report: 100.00% on `src/mcp/client/auth/oauth2.py`, strict-no-cover clean.
4f5e8e3 to
97b4bfa
Compare
On Python 3.10/3.11, coverage.py (sys.settrace backend) misattributes
the False arm of compound boolean predicates inside an async with block
in an async generator, leading to spurious partial-branch warnings on
4 lines in async_auth_flow:
- line 536: `if not is_token_valid() and can_refresh_token():` (Phase 1)
- line 546: same (Phase 2 double-check inside refresh_lock)
- line 548: `if refresh_request is not None:` (re-check skip path)
- line 553: `if not await _handle_refresh_response(...):` (refresh
success path)
Python 3.12+ (sys.monitoring) tracks these branches correctly. Add
`# pragma: no branch` to the 4 lines as a workaround for the legacy
backend only. An inline comment block above the Phase 1 if documents
the rationale.
Also add 2 unit tests that explicitly exercise the affected branches so
the coverage drop is shown to be a tool-precision issue, not missing
tests:
- test_phase1_skips_refresh_when_token_valid:
Phase 1 sees is_token_valid=True and skips Phase 2 entirely.
- test_refresh_success_proceeds_to_phase3_without_resetting_initialized:
_handle_refresh_response returns True (HTTP 200) so the
_initialized reset is skipped and Phase 3 proceeds with the
fresh token.
Local verification (Python 3.10):
pytest tests/client/test_auth.py -n auto + coverage report
→ 99.43% (was 98.30%, BrPart 1, Missing only line 206 which is
outside the test_auth.py scope). Full suite should reach 100%.
97b4bfa to
a09f68e
Compare
|
Re-confirmation + offer of a conflict-free rebase. This fix is needed: #2847 just landed (June 12) as a second, independent production repro of the exact The PR currently shows Conflict-free rebase (author/commits preserved): https://github.com/Bartok9/python-sdk/tree/salvage/2660-oauth-lock-rebase — happy to open it as a fresh PR if @peisuke is unavailable to push, or @peisuke can cherry-pick. The single-flight |
Narrows the lock scope inside
OAuthClientProvider.async_auth_flowso the actual HTTP requestyieldruns outside any lock. Concurrent requests through the same provider — in particular, a POST issued while a GET SSE long-poll is in flight — no longer serialize oncontext.lock. A newrefresh_lockpreserves single-flight semantics for token refresh so the change does not introduce a refresh race.Motivation and Context
OAuthClientProvider.async_auth_flowpreviously heldself.context.lockfor the entire flow body, including theresponse = yield requestthat hands the request off to httpx. For requests that complete quickly this is fine, but a Streamable HTTP transport opens a long-lived GET SSE stream for server-initiated messages alongside the POST stream for client-initiated RPCs. Both go through the sameOAuthClientProviderinstance, so the GET SSE long-poll'syield requestholdscontext.lockfor the entire SSE lifetime. Any concurrenttools/callPOST then blocks on the lock until the SSE returns, producing a multi-second stall in OAuth-authenticated MCP connections.This commit splits the single coarse lock into purpose-specific scopes:
context.lock): initialize state, captureprotocol_version, and decide whether a refresh is needed. Short-held; no HTTP I/O.refresh_lock, new): single-flight token refresh. The refreshyieldhappens outside any lock. A double-check insidecontext.lockensures concurrent waiters do not redundantly refresh after another coroutine completed one.context.lock): 401 / 403 full OAuth re-auth path. Conservatively kept under lock because this path is rare and its yielded sub-requests target the AS, not the resource server. A future change can narrow this further with the same pattern.refresh_lockis added toOAuthContext(alongside the existinglock) so concurrent expirations trigger at most one refresh request. The double-check pattern insiderefresh_lockmakes the second waiter observe the freshly-updated token and skip its own refresh.How Has This Been Tested?
Five new tests in
tests/client/test_auth.pyexercise the refactor:test_concurrent_request_not_blocked_by_pending_long_running_request— the core regression test. Drives oneasync_auth_flowgenerator to its yield (simulating a GET SSE long-poll), then opens a second flow on the same provider and asserts it reaches its own yield withinanyio.fail_after(5). Pre-fix this hangs indefinitely.test_refresh_lock_double_check_skips_redundant_refresh— a flow started after another finished its refresh observes the fresh token in Phase 1 and skips Phase 2 entirely; no second refresh request is sent.test_double_check_inside_refresh_lock_skips_second_refresh— monkeypatchesis_token_validto return False in Phase 1 and True in the Phase 2 double-check, exercising the branch where a second coroutine's refresh is correctly elided insiderefresh_lock.test_refresh_with_failed_status_clears_tokens— covers the failed-refresh path (status_code != 200) including the_initialized = Falsereset.test_refresh_with_invalid_json_clears_tokens— covers theValidationErrorbranch when the refresh response body is malformed../scripts/testis clean: 1177 passed / 98 skipped / 1 xfailed, 100.00% coverage onsrc/mcp/client/auth/oauth2.py,strict-no-coverreports no leftover# pragma: no coverviolations. Four previously-needed pragmas in_handle_refresh_responseand the Phase 2 yield are now covered by the new tests and removed.Breaking Changes
None. Public API is unchanged. Concurrency semantics narrow only — previously-serialized requests now proceed in parallel; token refresh remains serialized via
refresh_lock.Types of changes
Checklist
Additional context
Targets
main(v2 development). The same bug exists onv1.x; if a backport is wanted, that can land as a follow-up PR.Fixes #1326